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Abstract: It is now well established that sparse signal models are well suited 
to restoration tasks and can effectively be learned from audio, image, and video 
data. Recent research has been aimed at learning discriminative sparse models 
instead of purely reconstructive ones. This paper proposes a new step in that 
direction, with a novel sparse representation for signals belonging to different 
classes in terms of a shared dictionary and multiple class-decision functions. The 
linear variant of the proposed model admits a simple probabilistic interpretation, 
while its most general variant admits an interpretation in terms of kernels. An 
optimization framework for learning all the components of the proposed model 
is presented, along with experimental results on standard handwritten digit and 
texture classification tasks. 
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Apprentissage de dictionnaires supervise 



Resume : II est maintenant bien etabli que les representations parcimonieuses 
de signaux sont bien adaptees a des taches de restauration d'image, de sons ou de 
video. De recherches recentes ont eu pour but d'apprendre des representations 
discriminantcs au lieu de seulemcnt reconstructives. Cc travail propose un 
nouveau cadre pour representer des signaux appartenant a plusieurs classes 
diffcrcntcs, en apprenant de fagon simultanee un dictionnaire partage et de 
multiples fonctions de decision. On montre que la variante lineaire de ce cadre 
admet unc interpretation probabilistique simple, tandis que la version plus 
generate peut s'interpreter en tcrme de noyaux. Nous proposons une methode 
d'optimisation efficacc ct nous cvaluons le modelc sur un probleme de recon- 
naissance de chiffrcs manuscrits ct de classification de textures. 
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1 Introduction 

Sparse and overcomplete image models were first introduced in [13] for modeling 
the spatial receptive fields of simple cells in the human visual system. The linear 
decomposition of a signal using a few atoms of a learned dictionary, instead of 
predefined ones-such as wavelets-has recently led to state-of-the-art results for 
numerous low- level image processing tasks such as denoising [5J, showing that 
sparse models are well adapted to natural images. Unlike principal component 
analysis decompositions, these models are most ofen overcomplete, with a num- 
ber of basis elements greater than the dimension of the data. Recent research 
has shown that sparsity helps to capture higher-order correlation in data: In 
[21 [H] > sparse decompositions are used with predefined dictionaries for face and 
signal recognition. In [13] , dictionaries are learned for a reconstruction task, 
and the sparse decompositions are then used a posteriori within a classifier. 
In [12j . a discriminative method is introduced for various classification tasks, 
learning one dictionary per class; the classification process itself is based on the 
corresponding reconstruction error, and does not exploit the actual decompo- 
sition coefficients. In |17) . a generative model for document representation is 
learned at the same time as the parameters of a deep network structure. The 
framework we present in this paper extends these approaches by learning si- 
multaneously a single shared dictionary as well as multiple decision functions 
for different signal classes in a mixed generative and discriminative formulation 
(see also [TH], where a different discrimination term is added to the classical 
reconstructive one for supervised dictionary learning via class supervised simul- 
taneous orthogonal matching pursuit).. Similar joint generative/discriminative 
frameworks have started to appear in probabilistic approaches to learning, e.g., 
[U [SI [TU1 HH EH HO] , but not, to the best of our knowledge, in the sparse dictio- 
nary learning framework. Section [5] presents the formulation and Section [3] its 
interpretation in term of probability and kernel frameworks. The optimization 
procedure is detailed in Section 0J and experimental results are presented in 
Section El 

2 Supervised dictionary learning 

We present in this section the core of the proposed model. We start by describ- 
ing how to perform sparse coding in a supervised fashion, then show how to 
simultaneously learn a discriminative/reconstructive dictionary and a classifier. 

2.1 Supervised Sparse Coding 

In classical sparse coding tasks, one considers a signal x in R" and a fixed 
dictionary D = [di, . . . , dfc] in K" xfe (allowing k > n, making the dictionary 
overcomplete). In this setting, sparse coding with an i\ regularization 1 amounts 
to computing 

TZ*(x,T>)= mm \\x - Dallo + Ai[|a||i. (1) 

aeffi fc 

lr The l p regularization term of a vector x for p > is defined as ||cc||p = (5Z™=i l a5 Wl p )- 
||.||j> is a norm when p > f. When p = 0, it counts the number of non-zeros elements in the 
vector. 
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It is well known in the statistics, optimization, and compressed sensing commu- 
nities that the l\ penalty yields a sparse solution, very few non-zero coefficients 
in a, 3J, although there is no explicit analytic link between the value of Ai 
and the effective sparsity that this model yields. Other sparsity penalties using 
the Iq (or more generally £ p ) regularization can be used as well. Since it uses 
a proper norm, the t\ formulation of sparse coding is a convex problem, which 
makes the optimization tractable with algorithms such as those introduced in 
[4j [7] , and has proven in our proposed framework to be more stable than its 
£o counterpart, in the sense that the resulting decompositions are less sensi- 
tive to small perturbations of the input signal x. Note that sparse coding with 
an £q penalty is an NP-hard problem and is often approximated using greedy 
algorithms. 

In this paper, we consider a different setting, where the signal may belong to 
any of p different classes. We model the signal x using a single shared dictionary 
D and a set of p decision functions g.i(x,a,6) (i — 1, . . . ,p) acting on x and 
its sparse code a over D. The function gi should be positive for any signal in 
class i and negative otherwise. The vector 6 parametrizes the model and will be 
jointly learned with D. In the following, we will consider two kinds of decision 
functions: 

(i) linear in a: gi(x,a,6) = wfa + bi, where 9 = {w; G IR fe ,6i G R}f =1 , 
and the vectors w.; (i = I, . . . ,p) can be thought of as p linear models for the 
coefficients a, with the scalars bi acting as biases; 

(ii) bilinear in x and a: gi(x,a,6) = x T WiOt + bi, where = {W; G 
R nxfe , bi G R}f = i- Note that the number of parameters in (ii) is greater than in 
(i), which allows for richer models. One can interpret W, as a filter encoding 
the input signal x into a model for the coefficients a, which has a role similar 
to the encoder in |16j but for a discriminative task. 

Let us define softmax discriminative cost functions as 

p 

Ci( Xl ,...,x p )=log(^2e x i- Xi ) 

3=1 

for i = 1, . . . ,p. These are multiclass versions of the logistic function, enjoying 
properties similar to that of the hinge loss from the SVM literature, while being 
differentiable. Given some input signal x and fixed (for now) dictionary D and 
parameters 6, the supervised sparse coding problem for the class p can be defined 
as computing 

Sf(x, D, 6) =min<Si(a,a; ) D,0) ) (2) 

where 

Si(a,x,T),0) =C-({ fc -(x,a,e)}? =1 )+Ao||aB-Da||| + Ai||a||i. (3) 

Note the explicit incorporation of the classification and discriminative compo- 
nent into sparse coding, in addition to the classical reconstructive term (see |18j 
for a different classificaiton component). In turn, any solution to this problem 
provides a straightforward classification procedure, namely: 

i*(x,T),9) = axgminSf(x,D,0). (4) 

i— l,...,p 

Compared with earlier work using one dictionary per class [12) . this model 
has the advantage of letting multiple classes share some features, and uses the 
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coefficients a of the sparse representations as part of the classification procedure, 
thereby following the works from HH [21] , but with learned representations 
optimized for the classification task similar to [3 H8] . As shown in Section 
3, this formulation has a straightforward probabilistic interpretation, but let us 
first see how to learn the dictionary D and the parameters 6 from training data. 

2.2 SDL: Supervised Dictionary Learning 

Let us assume that we are given p sets of training data Tj, i = 1, . . . ,p, such 
that all samples in Ti belong to class i. The most direct method for learning D 
and 6 is to minimize with respect to these variables the mean value of S*, with 
an £2 regularization term to prevent overfitting: 

p 

m i n(EE 5 n^:D ) 0))+A 2 ||0||!, s.t. V % = 1, . . . , *, ||di|| a <l. (5) 

Since the reconstruction errors \\x — T)a\\2 are invariant to scaling simultane- 
ously D by a scalar and a by its inverse, constraining the £2 norm of columns 
of D prevents any transfer of energy between these two variables, which would 
have the effect of overcoming the sparsity penalty. Such a constraint is classical 
in sparse coding [3]. We will refer later to this model as SDL-G (supervised 
dictionary learning, generative). 

Nevertheless, since the classification procedure from Eq. (j4]) will compare the 
different residuals S* of a given signal for i — 1, . . . ,p, a more discriminative 
approach is to not only make the S* small for signals with label i, as in (J5J), but 
also make the value of S* greater than <S* for j different than i, which is the 
purpose of the softmax function Ct. This leads to: 

p 

mi s(EE c ^^^' D ^)^=i)) +A2 H ll2 s -t- Vi=l,...,fc, ||di|| 2 <l. 
0,9 i=1 jeTi 

(6) 

As detailed below, this problem is more difficult to solve than Eq. ([5|), and 
therefore we adopt instead a mixed formulation between the minimization of 
the generative Eq. ^ and its discriminative version ©, [T5] — that is, 

p 

pCi({Sf fo.D, 0)}f =1 ) + (1 - riStfa-D, 6)) +X 2 \\9\\ 2 2 

s.t. Vi, \\di\\ 2 <l, (7) 

where \x controls the trade-off between reconstruction from Eq. ^ and discrim- 
ination from Eq. ([H]). This is the proposed generative/discriminative model for 
sparse signal representation and classification from learned dictionary D and 
model 6. We will refer to this mixed model as SDL-D, (supervised dictionary 
learning, discriminative). 

Before presenting the proposed optimization procedure, we provide below 
two interpretations of the linear and bilinear versions of our formulation in 
terms of a probabilistic graphical model and a kernel. 
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3 Interpreting the model 

3.1 A probabilistic interpretation of the linear model 

Let us first construct a graphical model which gives a probabilistic interpretation 
to the training and classification criteria given above when using a linear model 
with zero bias (no constant term) on the coefficients — that is, gi(x,a,6) = 
w^a. This model consists of the following components (Figure [1]): 

• The matrices D and W are parameters of the problem, with a Gaussian prior 
on W, p(W) oc e~ A2 ll w H2, and on the columns of D, p(D) oc T]f = i e^l^'H*, 
where the 7;'s are the Gaussian parameters. All the d;'s are considered inde- 
pendent of each other. 

• The coefficients a j are latent variables with a Laplace prior, p(aj) oc e — A 1 1 1 ^ j 1 1 1 

• The signals Xj are generated according to a Gaussian probability distribution 
conditioned on D and otj, p(xj\a.j, D) oc e^ x °^ Xj ^ r>aj ^ . All the Xj's are con- 
sidered independent from each other. 

• The labels yj are generated according to a probability distribution conditioned 

on W and txj, and given by p(yj = i\a.j, W) = e _Wi aj /SILi e~ W ' OLj ■ Given 
D and W, all the triplets (a.j,Xj,yj) are independent. 

What is commonly called "generative training" in the literature (e.g., [TUl 
H5]). amounts to finding the maximum likelihood for D and W according to 
the joint distribution p({xj, y-,}™^, D, W), where the Xj's and the y/s are re- 
spectively the training signals and their labels. It can easily be shown (details 
omitted due to space limitations) that there is an equivalence between this gen- 
erative training and our formulation in Eq. ([5]) under MAP approximations. 2 
Although joint generative modeling of x and y through a shared representa- 
tion, e.g., [5], has shown great promise, we show in this paper that a more 
discriminative approach is desirable. "Discriminative training" is slightly dif- 
ferent and amounts to maximizing p({yj}J^i, D, WHxj}™;) with respect to 
D and W: Given some input data, one finds the best parameters that will 
predict the labels of the data. The same kind of MAP approximation relates 
this discriminative training formulation to the discriminative model of Eq. © 

2 We are also investigating how to properly estimate D by marginalizing over a instead of 
maximizing with respect to that parameter. 
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(again, details omitted due to space limitations). The mixed approach from Eq. 
d?]) is a classical trade-off between generative and discriminative (e.g., [TUlfTB]). 
where generative components are often added to discriminative frameworks to 
add robustness, e.g., to noise and occlusions (see examples of this for the model 
in [IB]). 



3.2 A kernel interpretation of the bilinear model 

Our bilinear model with gi(x,a,0) = x T ~Wi<y. + bi does not admit a straight- 
forward probabilistic interpretation. On the other hand, it can easily be inter- 
preted in terms of kernels: Given two signals X\ and X2, with coefficients a.\ and 
a.2, using the kernel K(x\,X2) — 0.^0.2X^X2 in a logistic regression classifier 
amounts to finding a decision function of the same form as (ii). It is a product of 
two linear kernels, one on the a's and one on the input signals x. Interestingly, 
Raina et al. [14j learn a dictionary adapted to reconstruction on a training set, 
then train an SVM a posteriori on the decomposition coefficients a. They derive 
and use a Fisher kernel, which can be written as K'{x\, X2) = cc^a^rf r 2 in this 
setting, where the r's are the residuals of the decompositions. Experimentally, 
we have observed that the kernel K, where the signals x replace the residuals 
r, generally yields a level of performance similar to K', and often actually does 
better when the number of training samples is small or the data are noisy. 



4 Optimization procedure 

Classical dictionary learning techniques (e.g., [D H31 Hi] ) > address the problem 
of learning a reconstructive dictionary D in M. nxk well adapted to a training set 
T as 

min X) I \ x i ~ Da i I la + A * I l a i I U > ( 8 ) 

which is not jointly convex in (D, a), but convex with respect to each unknown 
when the other one is fixed. This is why block coordinate descent on D and 
a performs reasonably well p] I13[ 114] , although not necessarily providing the 
global optimum. Training when /i = (generative case), i.e., from Eq. l[5]). 
enjoys similar properties and can be addressed with the same optimization pro- 
cedure. Equation ([5]) can be rewritten as: 



mm 
u,0,a 



V 

(Z1Z]^( X J'^' D ' )) + A 2l|0Hi. s -t- V* = 1 J ...,A > ||di|| 2 <l. 



(9) 

Block coordinate descent consists therefore of iterating between supervised sparse 
coding, where D and 6 are fixed and one optimizes with respect to the a's and 
supervised dictionary update, where the coefficients etj's are fixed, but D and 6 
are updated. Details on how to solve these two problems are given in Section 
4.1 and 4.2. 

The discriminative version of SDL from Eq. ([6]) is more problematic. The 
minimization of the term Ci({Si(atji, Xj,T), 0)}f =1 ) with respect to D and 6 
when the ct^'s are fixed, is not convex in general, and does not necessarily 
decrease the first term of Eq. i.e., Ci({Si(xj, D, 0)}f_i). To reach a lo- 
cal minimum for this difficult problem, we have chosen a continuation method, 
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Input: p (number of classes); n (signal dimensions); (training sig- 

nals); k (size of the dictionary); Ao, Ai, A2 (parameters); < jix < ji2 < • • • < 
Mm < 1 (increasing sequence); 
Output: D 6 W nxk (dictionary); 9 (parameters). 
Initialization: Set D to a random Gaussian matrix. Set 9 to zero. 
Loop: For /x = /i X ,.. • , A*m, 

Loop: Repeat until convergence (or a fixed number of iterations), 

• Supervised sparse coding: Compute, for all i = 1, . . . ,p, all j in Ti, and 
all I = 1, . . . ,p, 

a*; = argmin5i(a, Xj, D, 9). (10) 

• Dictionary update: Solve, under the constraint ||di|| < 1 for all I = 
l,...,k 

p 

mi 2 (E E MC i ({5Ka* Z) ^,D,0)}f =1 )+(l-M)5 i (a*.,Xj,D,0))+A 2 ||0|||. 

(11) 



Figure 2: SDL: Supervised dictionary learning algorithm. 



starting from the generative case and ending with the discriminative one as in 
j!2j . The algorithm is presented on Figure and details on the hyperparame- 
ters' settings are given in Section 5. 

4.1 Supervised sparse coding 

The supervised sparse coding problem from Eq. (|10|) (D and 9 are fixed in 
this step), amounts to minimizing a convex function under an i\ penalty. The 
fixed-point continuation method (FPC) from [7] achieves state-of-the-art results 
in terms of convergence speed for this class of problems. It has proven in our 
experiments to be simple, efficient, and well adapted to our supervised sparse 
coding problem. Algorithmic details are given in [7]. For our specific problem, 
denoting by / the convex function to minimize, this method only requires V/ 
and a bound on the spectral norm of its Hessian TL / . Since the we have chosen 
decision functions gi in Eq. (|10|) which are linear in a, there exists, for each 
signal x to be sparsely represented, a matrix A in M. kxp and a vector b in K p 
such that 

/(a) = C t {A T a + h) + \ Q \\x -T>a\\l, 

V/(a)= AVC i (A T o; + b)-2AoD T (a;-DQ ; ), 

and it can be shown that, if 1 1 TU 1 1 2 denotes the spectral norm of a matrix 
U (which is the magnitude of its largest eigenvalue), then ||H/||2 < (1 — 
i)| |A T A| I2+2A0I |D T D| I2. In the case where p — 2 (only two classes), we can ob- 
tain a tighter bound, ||W/(a)|| 2 < e- Cl ( ATa )- C2 ( ATa )||a 2 -a 1 ||2+2A ||D T D|| 2 , 
where ai and a 2 are the first and second columns of A. 
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4.2 Dictionary update 

The problem of updating D and 9 in Eq. is not convex in general (except 
when fi is close to 0), but a local minimum can be obtained using projected 
gradient descent (as in the general literature on dictionary learning, this local 
minimum has experimentally been found to be good enough for our formulation) . 
Denoting E(D,9) the function we want to minimize in Eq. (fTTj) . we just need 
the partial derivatives of E with respect to D and the parameters 9. Details 
when using the linear model for the a's, gi(x, a, 0) = wf a + bi, and 9 = {W G 
R fcxp , b G W}, are 



dE 
dD 

dE 
dW 

dE 
~db 



E E E^K ; VC ; T (W T a* ; + b), 

i=l jeTi 1=1 

f ^(W^+b), 

i=i jeTi i=i 



(12) 



where 



Wj, - tiVC. l {{S m {a.* m ,x J ,T>,9)Y m=l )[l] + (1 - /i)l I=j 



(13) 



Partial derivatives when using our model with the bilinear decision functions 
gi(x,a,6) = x Tn WiOt + bi are not given in this paper because of space limita- 
tions. 



5 Experimental validation 

We compare in this section a reconstructive approach, dubbed REC, which con- 
sists of learning a reconstructive dictionary D as in [14j and then learning the 
parameters 9 a posteriori; SDL with generative training (dubbed SDL-G); and 
SDL with discriminative learning (dubbed SDL-D). We also compare the per- 
formance of the linear (L) and bilinear (BL) decision functions. 

Before presenting experimental results, let us briefly discuss the choice of the 
five model parameters Ao, Ai, A2, [i and k (size of the dictionary). Tuning all of 
them using cross-validation is cumbersome and unnecessary since some simple 
choices can be made, some of which can be done sequentially. We define first 
the sparsity parameter k = j^, which dictates how sparse the decompositions 
are. When the input data points have unit £2 norm, choosing k = 0.15 was 
empirically found to be a good choice. The number of parameters to learn is 
linear in k, the number of elements in the dictionary D. For reconstructive tasks, 
k = 256 is a typical value often used in the literature (e.g., pQ). Nevertheless, 
for discriminative tasks, increasing the number of parameters is likely to allow 
overfitting, and smaller values like k — 64 or k = 32 are preferred. The scalar A2 
is a regularization parameter for preventing the model to overfit the input data. 
As in logistic regression or support vector machines, this parameter is crucial 
when the number of training samples is small. Performing cross validation with 
the fast method REC quickly provides a reasonable value for this parameter, 
which can be used afterward for SDL-G or SDL-D. 
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Once k, k and A2 are chosen, let us see how to find Ao- In logistic regression, 
a projection matrix maps input data onto a softmax function, and its shape and 
scale are adapted so that it becomes discriminative according to an underlying 
probabilistic model. In the model we are proposing, the functions S* are also 
mapped onto a softmax function, and the parameters D and 9 are adapted 
(learned) in such a way that S* becomes discriminative. However, for a fixed k, 
the second and third terms of Sf, namely Ao||a; — Da||| and Aok||c*||i, are not 
freely scalable when adapting D and 6, since their magnitudes are bounded. 
Ao plays the important role of controlling the trade-off between reconstruction 
and discrimination in Eq. ([3]). First, we perform cross-validation for a few 
iterations with /j, = to find a good value for SDL-G. Then, a scale factor 
making the >S*'s discriminative for /i > can be chosen during the optimization 
process: Given a set of <S*'s, one can compute a scale factor 7 such that 7 = 
argmin 7 Y%=i SjeT; Ci({lS*( x j > D, W)}). We therefore propose the following 
strategy, which has proven to be efficient during our experiments: Starting from 
small values for Ao and a fixed k, we apply the algorithm in Figure [21 and after 
a supervised sparse coding step, we compute the best scale factor 7, and replace 
Ao and Ai by 7A0 and 7A1. Typically, applying this procedure during the first 
10 iterations has proven to lead to reasonable values for this parameter. 

Since we are following a continuation path starting from fj, = to [i = 1, 
the optimal value of /1 is found along the path by measuring the classification 
performance of the model on a validation set during the optimization. 

5.1 Digits recognition 

In this section, we present experiments on the popular MNIST [TT] and USPS 
handwritten digit datasets. MNIST is composed of 70 000 images of 28 x 28 
pixels, 60 000 for training, 10 000 for testing, each of them containing a hand- 
written digit. USPS is composed of 7291 training images and 2007 test images. 
As it is often done in classification, we have chosen to learn pairwise binary 
classifiers, one for each pair of digits. Although we have presented a multi- 
class framework, pairwise binary classifiers have proven to offer a slightly better 
performance in practice. Five-fold cross validation has been performed to find 
the best pair (k,n). The tested values for k are {24,32,48,64,96}, and for k, 
{0.13,0.14,0.15,0.16,0.17}. Then, we have kept the three best pairs of param- 
eters and used them to train three sets of pairwise classifiers. For a given patch 
x, the test procedure consists of selecting the class which receives the most votes 
from the pairwise classifiers. All the other parameters are obtained using the 
procedure explained above. Classification results are presented on Table[T]whcn 
using the linear model. We see that for the linear model L, SDL-D L performs 
the best. REC BL offers a larger feature space and performs better than REC 
L. Nevertheless, we have observed no gain by using SDL-G BL or SDL-D BL 
instead of REC BL. Since the linear model is already performing very well, one 
side effect of using BL instead of L is to increase the number of free parame- 
ters and thus to cause overfitting. Note that the best error rates published on 
these datasets (without any modification of the training set) are 0.60% [H] for 
MNIST and 2.4% [5] for USPS, using methods tailored to these tasks, whereas 
ours is generic and has not been tuned to the handwritten digit classification 
domain. 
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REC L 


SDL-G L 


SDL-D L 


REC BL 


k-NN, e 2 


SVM-Gauss 


MNIST 


4.33 


3.56 


1.05 


3.41 


5.0 


1.4 


USPS 


6.83 


6.67 


3.54 


4.38 


5.2 


4.2 



Table 1: Error rates on MNIST and USPS datasets in percents from the REC, 
SDL-G L and SDL-D L approaches, compared with k-nearest neighbor and SVM 
with a Gaussian kernel [TTj . 



The purpose of our second experiment is not to measure the raw performance 
of our algorithm, but to answer the question "are the obtained dictionaries D 
discriminative per se or is the pair (D,0) discriminative?". To do so, we have 
trained on the USPS dataset 10 binary classifiers, one per digit in a one vs all 
fashion on the training set. For a given value of //, we obtain 10 dictionaries D 
and 10 sets of parameters 0, learned by the SDL-D L model. 

To evaluate the discriminative power of the dictionaries D, we discard the 
learned parameters 8 and use the dictionaries as if they had been learned in 
a reconstructive REC model: For each dictionary, we decompose each image 
from the training set by solving the simple sparse reconstruction problem from 
Eq. H]) instead of using supervised sparse coding. This provides us with some 
coefficients a, which we use as features in a linear SVM. Repeating the sparse 
decomposition procedure on the test set permits us to evaluate the performance 
of these learned linear SVM. We plot the average error rate of these classifiers 
on Figure EH for each value of \i. We see that using the dictionaries obtained 
with discrimative learning (// > 0, SDL-D L) dramatically improves the perfor- 
mance of the basic linear classifier learned a posteriori on the ex's, showing that 
our learned dictionaries are discriminative per se. Figure U shows a dictionary 
adapted to the reconstruction of the MNIST dataset and a discriminative one, 
adapted to "9 vs all" . 




A 0.2 0-4 0.6 , . 0.8 

Figure 3: Average error rate m percents obtained by our 

in a discriminative framework (SDL-D L) for various values of /i, when used in 

used at test time in a reconstructive framework (REC-L). See text for details. 



ir dictionaries learned 



5.2 Texture classification 

In the digit recognition task, our BL bilinear framework did not perform better 
than L and we believe that one of the main reasons is due to the simplicity of the 
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(a) REC, MNIST (b) SDL-D, MNIST 



Figure 4: On the left, a reconstructive dictionary, on the right a discriminative 
one for the task "9 vs all" . 



M 


REC L 


SDL-G L 


SDL-D L 


REC BL 


SDL-G BL 


SDL-D BL 


Gain 


300 


48.84 


47.34 


44.84 


26.34 


26.34 


26.34 


0% 


1500 


46.8 


46.3 


42 


22.7 


22.3 


22.3 


2% 


3000 


45.17 


45.1 


40.6 


21.99 


21.22 


21.22 


4% 


6000 


45.71 


43.68 


39.77 


19.77 


18.75 


18.61 


6% 


15000 


47.54 


46.15 


38.99 


18.2 


17.26 


15.48 


15% 


30000 


47.28 


45.1 


38.3 


18.99 


16.84 


14.26 


25% 



Table 2: Error rates for the texture classification task using various frameworks 
and sizes M of training set. The last column indicates the gain between the 
error rate of REC BL and SDL-D BL. 



task, where a linear model is rich enough. The purpose of our next experiment 
is to answer the question "When is BL worth using?". We have chosen to 
consider two texture images from the Brodatz dataset, presented in Figure O 
and to build two classes, composed of 12 x 12 patches taken from these two 
textures. We have compared the classification performance of all our methods, 
including BL, for a dictionary of size k = 64 and n = 0.15. The training set 
was composed of patches from the left half of each texture and the test sets 
of patches from the right half, so that there is no overlap between them in the 
training and test set. Error rates are reported for varying sizes of the training 
set. This experiment shows that in some cases, the linear model completely 
fails and BL is necessary. Discrimination helps especially when the size of the 
training set is particularly valuable for large training sets. Note that we did 
not perform any cross-validation to optimize the parameters k and k for this 
experiment. Dictionaries obtained with REC and SDL-D BL are presented in 
Figure [5] Note that though they are visually quite similar, they lead to very 
different performance. 
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(a) Texture 1 (b) Texture 2 
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(c) REC (d) SDL-D BL 

Figure 5: Top: Test textures. Bottom left: reconstructive dictionary. Bottom 
right: discriminative dictionary. 
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6 Conclusion 

We have introduced in this paper a discriminative approach to supervised dictio- 
nary learning that effectively exploits the corresponding sparse signal decompo- 
sitions in image classification tasks, and affords an effective method for learning 
a shared dictionary and multiple (linear or bilinear) decision functions. Future 
work will be devoted to adapting the proposed framework to shift-invariant 
models that are standard in image processing tasks, but not readily generalized 
to the sparse dictionary learning setting. We are also investigating extensions to 
unsupervised and semi-supervised learning and applications into natural image 
classification. 

References 

[1] M. Aharon, M. Elad, and A. M. Bruckstein. The K-SVD: An algorithm for 
designing of overcomplete dictionaries for sparse representations. IEEE Trans. 
SP, 54(ll):4311-4322, November 2006. 

[2] D. Blei and J. McAuliffe. Supervised topic models. In Adv. NIPS, 2007. 

[3] D. L. Donoho. Compressive sampling. IEEE Trans. IT, 52(4):1289-1306, April 
2006. 

[4] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Ann. 
Statist, 32(2):407-499, 2004. 

[5] M. Elad and M. Aharon. Image denoising via sparse and redundant represen- 
tations over learned dictionaries. IEEE Trans. IP, 54(12):3736-3745, December 
2006. 

[6] B. Haasdonk and D. Keysers. Tangent distant kernels for support vector machines. 
In Proc. ICPR, 2002. 

[7] E. T. Hale, W. Yin, and Y. Zhang. A fixed-point continuation method for 
11-rcgularized minimization with applications to compressed sensing. Tech- 
nical report, Rice University,, 2007. CAAM Technical Report TR07-07, 
http: / / www.caam.rice.edu / ~optimization/Ll /fpc/ . 

[8] A. Holub and P. Perona. A discriminative framework for modeling object classes. 
In Proc. IEEE CVPR, 2005. 

[9] K. Huang and S. Aviyente. Sparse representation for signal classification. In Adv. 
NIPS, 2006. 

[10] J. A. Lasserre, CM. Bishop, and T.P. Minka. Principled hybrids of generative 
and discriminative models. In Proc. IEEE CVPR, 2006. 

[11] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied 
to document recognition. Proceedings of the IEEE, 86(ll):2278-2324, November 
1998. 

[12] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Learning discriminative 
dictionaries for local image analysis. In Proc. IEEE CVPR, 2008. 

[13] B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: 
A strategy employed by vl? Vision Research, 37:3311-3325, 1997. 

[14] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng. Self-taught learning: 
transfer learning from unlabeled data. In ICML, 2007. 

[15] R. Raina, Y. Shen, A. Y. Ng, and A. McCallum. Classification with hybrid 
generative/discriminative models. In Adv. NIPS, 2004. 



INRIA 



Supervised Dictionary Learning 



15 



[16] M. Ranzato, C. Poultney, S. Chopra, and Y. LeCun. Efficient learning of sparse 
representations with an energy-based model. In Adv. NIPS, 2006. 

[17] M. Ranzato and M. Szummer. Semi-supervised learning of compact document 
representations with deep networks. In ICML, 2008. 

[18] F. Rodriguez and G. Sapiro. Sparse representations for image classification: 
Learning discriminative and reconstructive non-parametric dictionaries. Tech- 
nical report, University of Minnesota, December 2007. IMA Preprint 2213. 

[19] R. R. Salakhutdinov and G. E. Hinton. Learning a non-linear embedding by 
preserving class neighbourhood structure. In AI and Statistics, 2007. 

[20] J. Winn, A. Criminisi, and T. Minka. Object categorization by learned universal 
visual dictionary. In Proc. IEEE ICCV, 2005. 

[21] J. Wright, A. Y. Yang, A. Ganesh, S. Sastry, and Y. Ma. Robust face 
recognition via sparse representation. IEEE Trans. PAMI, 2008. to appear, 
http:/ /perception. csl.uiuc.edu/recognition/Home. html. 



RR n° 6652 




Centre de recherche INRJA Paris - Rocquencourt 
Domaine de Voluceau - Rocquencourt - BP 105 - 78153 Le Chesnay Cedex (France) 

Centre de recherche INRIA Bordeaux - Sud Ouest : Domaine Universitaire - 351, cours de la Liberation - 33405 Talence Cedex 
Centre de recherche INRIA Grenoble - Rhone-Alpes : 655, avenue de l'Europe - 38334 Montbonnot Saint-Ismier 
Centre de recherche INRIA Lille - Nord Europe : Pare Scientifique de la Haute Borne - 40, avenue Halley - 59650 Villeneuve d'Ascq 
Centre de recherche INRIA Nancy - Grand Est : LORIA, Technopole de Nancy-Brabois - Campus scientifique 
615, rue du Jardin Botanique - BP 101 - 54602 Villers-les-Nancy Cedex 
Centre de recherche INRIA Rennes - Bretagne Atlantique : IRISA, Campus universitaire de Beaulieu - 35042 Rennes Cedex 
Centre de recherche INRIA Saclay - Ile-de-France : Pare Orsay Universite - ZAC des Vignes : 4, rue Jacques Monod - 91893 Orsay Cedex 
Centre de recherche INRIA Sophia Antipolis - Mediterranee : 2004, route des Lucioles - BP 93 - 06902 Sophia Antipolis Cedex 



Editeur 

INRIA - Domaine de Voluceau - Rocquencourt, BP 105 - 78153 Le Chesnay Cedex (France) 

http: / /www. inria.fr 
ISSN 0249-6399 



